Code
# Load libraries
library(tidyverse)
library(sf)
library(spdep)
library(viridis)
library(readxl)
library(rnaturalearth)
library(rnaturalearthdata)
library(corrplot)With the dataset I had, the list of the world countries and their happiness scores, I wanted to look at what factors were the most influential in determining happiness scores. What I first did was filter out variables that had a lot of NAs.
year Life Ladder Log GDP per capita Social support
Min. :2005 Min. :2.375 Min. : 6.457 Min. :0.2902
1st Qu.:2010 1st Qu.:4.623 1st Qu.: 8.309 1st Qu.:0.7483
Median :2013 Median :5.363 Median : 9.408 Median :0.8340
Mean :2013 Mean :5.446 Mean : 9.245 Mean :0.8111
3rd Qu.:2016 3rd Qu.:6.268 3rd Qu.:10.209 3rd Qu.:0.9046
Max. :2019 Max. :8.019 Max. :11.728 Max. :0.9873
NA's :29 NA's :13
Healthy life expectancy at birth Freedom to make life choices
Min. :32.30 Min. :0.2575
1st Qu.:58.30 1st Qu.:0.6431
Median :65.10 Median :0.7575
Mean :63.17 Mean :0.7385
3rd Qu.:68.39 3rd Qu.:0.8524
Max. :77.10 Max. :0.9852
NA's :52 NA's :31
Generosity Perceptions of corruption Positive affect
Min. :-0.33177 Min. :0.0352 Min. :0.3217
1st Qu.:-0.11718 1st Qu.:0.6927 1st Qu.:0.6233
Median :-0.02372 Median :0.8036 Median :0.7208
Mean : 0.00011 Mean :0.7491 Mean :0.7096
3rd Qu.: 0.09115 3rd Qu.:0.8737 3rd Qu.:0.8011
Max. : 0.67992 Max. :0.9833 Max. :0.9436
NA's :83 NA's :103 NA's :21
Negative affect Confidence in national government Democratic Quality
Min. :0.08343 Min. :0.06877 Min. :-2.4483
1st Qu.:0.20595 1st Qu.:0.33641 1st Qu.:-0.7885
Median :0.25577 Median :0.46479 Median :-0.2259
Mean :0.26721 Mean :0.48316 Mean :-0.1350
3rd Qu.:0.31800 3rd Qu.:0.61626 3rd Qu.: 0.6531
Max. :0.70459 Max. :0.99360 Max. : 1.5825
NA's :15 NA's :191 NA's :149
Delivery Quality Standard deviation of ladder by country-year
Min. :-2.14497 Min. :0.863
1st Qu.:-0.71114 1st Qu.:1.747
Median :-0.21663 Median :1.987
Mean :-0.00209 Mean :2.052
3rd Qu.: 0.69965 3rd Qu.:2.280
Max. : 2.18472 Max. :4.073
NA's :148
Standard deviation/Mean of ladder by country-year
Min. :0.1339
1st Qu.:0.3107
Median :0.3756
Mean :0.3970
3rd Qu.:0.4632
Max. :1.0228
GINI index (World Bank estimate)
Min. :0.240
1st Qu.:0.309
Median :0.355
Mean :0.372
3rd Qu.:0.430
Max. :0.634
NA's :1133
GINI index (World Bank estimate), average 2000-2017, unbalanced panel
Min. :0.2495
1st Qu.:0.3223
Median :0.3688
Mean :0.3853
3rd Qu.:0.4329
Max. :0.6240
NA's :180
gini of household income reported in Gallup, by wp5-year
Min. :0.2010
1st Qu.:0.3696
Median :0.4282
Mean :0.4490
3rd Qu.:0.5171
Max. :0.9614
NA's :370
Most people can be trusted, Gallup
Min. :0.0666
1st Qu.:0.1398
Median :0.1984
Mean :0.2263
3rd Qu.:0.2816
Max. :0.6403
NA's :1668
Most people can be trusted, WVS round 1981-1984
Min. :0.1765
1st Qu.:0.2903
Median :0.3802
Mean :0.3902
3rd Qu.:0.4781
Max. :0.5717
NA's :1712
Most people can be trusted, WVS round 1989-1993
Min. :0.0660
1st Qu.:0.2236
Median :0.2924
Mean :0.2838
3rd Qu.:0.3417
Max. :0.5946
NA's :1611
Most people can be trusted, WVS round 1994-1998
Min. :0.0487
1st Qu.:0.1769
Median :0.2299
Mean :0.2498
3rd Qu.:0.2942
Max. :0.6477
NA's :1181
Most people can be trusted, WVS round 1999-2004
Min. :0.0759
1st Qu.:0.1558
Median :0.2320
Mean :0.2680
3rd Qu.:0.3855
Max. :0.6372
NA's :1319
Most people can be trusted, WVS round 2005-2009
Min. :0.0382
1st Qu.:0.1435
Median :0.1984
Mean :0.2643
3rd Qu.:0.3914
Max. :0.7373
NA's :1164
Most people can be trusted, WVS round 2010-2014
Min. :0.0315
1st Qu.:0.1187
Median :0.1935
Mean :0.2372
3rd Qu.:0.3350
Max. :0.6618
NA's :1124
Country name
0
year
0
Life Ladder
0
Log GDP per capita
29
Social support
13
Healthy life expectancy at birth
52
Freedom to make life choices
31
Generosity
83
Perceptions of corruption
103
Positive affect
21
Negative affect
15
Confidence in national government
191
Democratic Quality
149
Delivery Quality
148
Standard deviation of ladder by country-year
0
Standard deviation/Mean of ladder by country-year
0
GINI index (World Bank estimate)
1133
GINI index (World Bank estimate), average 2000-2017, unbalanced panel
180
gini of household income reported in Gallup, by wp5-year
370
Most people can be trusted, Gallup
1668
Most people can be trusted, WVS round 1981-1984
1712
Most people can be trusted, WVS round 1989-1993
1611
Most people can be trusted, WVS round 1994-1998
1181
Most people can be trusted, WVS round 1999-2004
1319
Most people can be trusted, WVS round 2005-2009
1164
Most people can be trusted, WVS round 2010-2014
1124
After filtering out the variables, I looked into the correlation matrix of the variables, which revealed a strong positive relationship between the happiness scores and “Log GDP per Capita,” “Healthy Life Expectancy,” and “Social Support.” What was interesting to find was that “Delivery Quality” was highly and positively correlated with the happiness scores. Additionally, negative correlations with “Perceptions of Corruption” and “gini of household income” underscore the importance of effective governance, which was presumable.
# Select relevant columns
happiness_data <- happiness_data %>%
select(
`year`, `Country name`, `Life Ladder`, `Log GDP per capita`, `Social support`, `Freedom to make life choices`, `Healthy life expectancy at birth`,
`Generosity`, `Perceptions of corruption`, `Democratic Quality`, `Delivery Quality`, `GINI index (World Bank estimate), average 2000-2017, unbalanced panel`,
`Confidence in national government`, `gini of household income reported in Gallup, by wp5-year`
)
happiness_data <- happiness_data %>%
rename(`gini of income` = `gini of household income reported in Gallup, by wp5-year`)# Ensure the Life Ladder column is numeric
happiness_data <- happiness_data %>%
mutate(`Life Ladder` = as.numeric(`Life Ladder`))
# Distribution of happiness scores
ggplot(happiness_data, aes(x = `Life Ladder`)) +
geom_histogram(binwidth = 0.5, fill = "blue", alpha = 0.7) +
theme_minimal() +
labs(title = "Distribution of Happiness Scores", x = "Happiness Score", y = "Frequency")# Boxplot of key variables to identify outliers
key_variables <- c(
"Life Ladder", "Log GDP per capita", "Delivery Quality",
"Perceptions of corruption", "gini of income"
)
happiness_data %>%
select(all_of(key_variables)) %>%
pivot_longer(cols = everything(), names_to = "Variable", values_to = "Value") %>%
ggplot(aes(x = Variable, y = Value)) +
geom_boxplot(fill = "skyblue") +
theme_minimal() +
labs(title = "Boxplots of Key Variables", y = "Value")Then, I checked if there were any outliers present in these variables in order to get a better analysis of the data. Yet, there didn’t seem to be any prominent outliers that seem to lead to a distorted correlation matrix.
# Average happiness scores over time
happiness_data %>%
group_by(year) %>%
summarize(Average_Happiness = mean(`Life Ladder`, na.rm = TRUE)) %>%
ggplot(aes(x = year, y = Average_Happiness)) +
geom_line(color = "blue", size = 1) +
geom_point(color = "red", size = 2) +
theme_minimal() +
labs(title = "Average Happiness Scores Over Time", x = "Year", y = "Average Happiness Score")Next, the temporal analysis showed a gradual upward trend in average happiness scores from 2005 to 2019. As there was a sudden decline in the happiness scores from 2005 to 2006, I looked into detail what was the reason for this sudden change.
# Filter data for the years 2006 and 2007
happiness_2005 <- happiness_data %>%
filter(year %in% c(2005))
# Calculate average happiness scores by country and year
average_scores <- happiness_2005 %>%
group_by(`Country name`, year) %>%
summarize(
avg_happiness = mean(`Life Ladder`, na.rm = TRUE)
) %>%
arrange(year, desc(avg_happiness)) # Sort by year and descending average happiness
# View the result
print(average_scores)# A tibble: 27 × 3
# Groups: Country name [27]
`Country name` year avg_happiness
<chr> <dbl> <dbl>
1 Denmark 2005 8.02
2 Netherlands 2005 7.46
3 Canada 2005 7.42
4 Sweden 2005 7.38
5 Australia 2005 7.34
6 Belgium 2005 7.26
7 Venezuela 2005 7.17
8 Spain 2005 7.15
9 France 2005 7.09
10 Saudi Arabia 2005 7.08
# ℹ 17 more rows
happiness_2006 <- happiness_data %>%
filter(year %in% c(2006))
# Calculate average happiness scores by country and year
average_scores <- happiness_2006 %>%
group_by(`Country name`, year) %>%
summarize(
avg_happiness = mean(`Life Ladder`, na.rm = TRUE)
) %>%
arrange(year, desc(avg_happiness)) # Sort by year and descending average happiness
# View the result
print(average_scores)# A tibble: 89 × 3
# Groups: Country name [89]
`Country name` year avg_happiness
<chr> <dbl> <dbl>
1 Finland 2006 7.67
2 Switzerland 2006 7.47
3 Norway 2006 7.42
4 New Zealand 2006 7.31
5 United States 2006 7.18
6 Israel 2006 7.17
7 Ireland 2006 7.14
8 Austria 2006 7.12
9 Costa Rica 2006 7.08
10 United Arab Emirates 2006 6.73
# ℹ 79 more rows
# A tibble: 10 × 2
`Country name` Average_Happiness
<chr> <dbl>
1 Denmark 7.69
2 Finland 7.57
3 Switzerland 7.55
4 Norway 7.54
5 Netherlands 7.46
6 Iceland 7.43
7 Canada 7.40
8 Sweden 7.37
9 New Zealand 7.31
10 Australia 7.29
# A tibble: 10 × 2
`Country name` Average_Happiness
<chr> <dbl>
1 Comoros 3.94
2 Zimbabwe 3.93
3 Yemen 3.91
4 Tanzania 3.69
5 Rwanda 3.65
6 Afghanistan 3.59
7 Togo 3.56
8 Burundi 3.55
9 Central African Republic 3.51
10 South Sudan 3.40
After filtering for specific years, 2005 and 2006, I noticed that the number of data for 2005 was lacking, and the countries included in the data for the year 2005 were those with high happiness scores. We could easily see that Denmark, Finland, and Switzerland were consistently high-ranking countries, while South Sudan and Afghanistan remained at the bottom due to systemic challenges like conflict and poverty.
# Scatterplot matrix for selected variables
ggplot(happiness_data, aes(x = `Log GDP per capita`, y = `Life Ladder`)) +
geom_point(alpha = 0.7, color = "darkblue") +
geom_smooth(method = "lm", color = "red", se = TRUE) +
theme_minimal() +
labs(
title = "Happiness vs. Log GDP per Capita",
x = "Log GDP per Capita",
y = "Happiness Score"
)# Scatterplot of Happiness vs. Delivery Quality
ggplot(happiness_data, aes(x = `Delivery Quality`, y = `Life Ladder`)) +
geom_point(alpha = 0.7, color = "green") +
geom_smooth(method = "lm", color = "red", se = TRUE) +
theme_minimal() +
labs(
title = "Happiness vs. Delivery Quality",
x = "Delivery Quality",
y = "Happiness Score"
)# Scatterplot of Happiness vs. Gini Income
ggplot(happiness_data, aes(x = `gini of income`, y = `Life Ladder`)) +
geom_point(alpha = 0.7, color = "green") +
geom_smooth(method = "lm", color = "red", se = TRUE) +
theme_minimal() +
labs(
title = "Happiness vs. Gini Income",
x = "GINI Income",
y = "Happiness"
)Scatterplots of happiness against key predictors reveal actionable insights. The positive relationship between happiness and “Log GDP per Capita” reinforces the importance of economic growth, while the negative correlation with the “GINI Index” highlights the adverse effects of income inequality. Similarly, variables such as “Delivery Quality” show moderate to strong positive correlations with happiness, underlining the role of governance and personal agency in improving well-being. With this EDA process, I could determine which variables to focus on for further spatial analyses.